Other R packages for semantic similarity analysis are only based on gene annotations, thus mainly applicable on the Gene Ontology. Besides annotation-based methods, simona also supports a large number of methods based on the topology of GO DAG, with no need of external annotation data. In this document, we compare the semantic similarity calculated based on gene annotations and based on topology of the GO DAG.

We use Gene Ontology (Biological Process namespace) as the test ontology and human as the organism.

library(simona)
dag = create_ontology_DAG_from_GO_db(org_db = "org.Hs.eg.db")
dag
## An ontology_DAG object:
##   Source: GO BP / GO.db package 
##   27942 terms / 55956 relations
##   Root: GO:0008150 
##   Terms: GO:0000001, GO:0000002, GO:0000003, GO:0000011, ...
##   Max depth: 18 
##   Avg number of parents: 2.00
##   Aspect ratio: 363.92:1 (based on the longest distance to root)
##                 782.78:1 (based on the shortest distance to root)
##   Relations: is_a, part_of
##   Annotations are available.
## 
## With the following columns in the metadata data frame:
##   id, name, definition

Since some GO terms have no gene annotated and the information contents (IC) cannot be calculated, we random sample 500 GO terms that have genes annotated (i.e. with IC values).

set.seed(123)
ic = term_IC(dag, method = "IC_annotation")
ic = ic[!is.na(ic)]
go_id = sample(names(ic), 500)

Compare MICA and LCA

For the IC-based method (which is here annotation-based. We name it as “IC_annotation” for easier discussion), the similarity is determined by the IC of the most informative commmon ancestor (MICA) of two terms, while for the topology-based method, the similarity is determined by the lowest common ancestor (LCA). We compare, for a given set of GO terms, how different their MICA and LCA are.

We have already set the gene annotation when calling create_ontology_DAG_from_GO_db(), thus calling MICA_term() directly uses IC_annotation as the IC method. MICA_term() and LCA_term() returns a symmetric matrix, thus we only take the lower triangle of the square matrix.

term1 = MICA_term(dag, go_id)
term1 = term1[lower.tri(term1)]

term2 = LCA_term(dag, go_id)
term2 = term2[lower.tri(term2)]

In all 500*(500-1)/2 = 124750 pairs of terms, we check how many MICA and LCA are the same.

sum(term1 == term2)
## [1] 122463
sum(term1 == term2)/length(term1)
## [1] 0.9816673

It shows for more than 98% of the term pairs, their MICA and LCA are the same terms. For the remaining 2287 (2%) term pairs, we check how different their MICA and LCA are.

l = term1 != term2
term1 = term1[l]
term2 = term2[l]

Although the MICA and LCA of these 2287 term pairs are different, we ask are their MICA and LCA in ancestor/offspring relations? I.e. is the MICA an ancestor or offspring of the LCA? The function shortest_distances_directed() calculates directed distance in the DAG. If term \(a\) cannot reach term \(b\), taking directions into consideration, the distance value will be -1.

term_unique = unique(c(term1, term2))
dist = shortest_distances_directed(dag, term_unique) # a non-symmetric matrix with terms as names
x = numeric(length(term1))
for(i in seq_along(term1)) {
    x[i] = ifelse(dist[term1[i], term2[i]] >= 0, dist[term1[i], term2[i]], dist[term2[i], term1[i]]) 
}
table(x)
## x
##   -1 
## 2287

It shows for the 2287 term pairs in this random dataset, none of their MICA and LCA are in ancestor/offspring relations.

To demonstrate it, we can take the MICA and LCA of the first term pair as an example. Here the syntax dag[, c(term1[1], term2[1])] returns an induced sub-DAG only taking c(term1[1], term2[1]) as leaf terms. We color the MICA term in red, and the LCA term in blue. "GO:0008150" is the root of the DAG (biological_process). The following plot shows the MICA term “GO:0010035” and the LCA term "GO:1901701" although are not in ancestor/offspring relations, are still close in the DAG.

dag2 = dag[, c(term1[1], term2[1])]
color = c("red", "blue")
names(color) = c(term1[1], term2[1])

dag_graphviz(dag2, color = color)

Based on the example graph above, we guess it might be a general pattern that the MICA and LCA of the 2287 term pairs are in sibling relations, i.e. under a same parent term, or at least very close to each other. Next we calculate the distance between MICA and LCA taking DAG as undirected. The function shortest_distances_via_NCA() calculates the shortest distance of two terms passing through their nearest common ancestor (NCA).

term_unique = unique(c(term1, term2))
dist = shortest_distances_via_NCA(dag, term_unique) # a symmetric matrix with terms as names
x = numeric(length(term1))
for(i in seq_along(term1)) {
    x[i] = dist[term1[i], term2[i]]
}
barplot(table(x), xlab = "Undirected distance between MICA and LCA of a term pair", ylab = "Number of term pairs")

The barplot shows 76.3% of the 2287 term pairs, their MICA and LCA are siblings (i.e. sharing the same parent). MICA and LCA are also close in the DAG for other term pairs.

Semantic similarity based on MICA and LCA

In general, based on the results we have generated, we could conclude the the MICA method and the LCA method are very similar. In the most of the time, they are the same set or very close in the DAG.

If thinking depth as a generalized measure of information content where deeper in the DAG a larger value of of depth a term has, In this way, LCA is also a type of MICA. we compare the global IC_annotation and depth.

ic = term_IC(dag, method = "IC_annotation")
depth = dag_depth(dag)

par(mfrow = c(1, 2))
boxplot(ic ~ depth, xlab = "Depth", ylab = "IC_annotation")
barplot(table(depth), xlab = "Depth", ylab = "Number of terms")

It shows depth and IC_annotation have very strong positive relations. Together with the high compatible sets of MICA and LCA, it can be expected that the semantic similarities based on MICA and LCA should also be highly similar.

In the following example, we compare "Sim_Lin_1998" and "Sim_WP_1994" methods. "Sim_Lin_1998" is based on the IC_annotation of MICA and "Sim_WP_1994" is based on the depth of LCA. Note that the following two heatmaps share the same row orders and column orders.

mat1 = term_sim(dag, go_id, method = "Sim_Lin_1998")
mat2 = term_sim(dag, go_id, method = "Sim_WP_1994")

od = hclust(dist(mat1))$order

library(ComplexHeatmap)

Heatmap(mat1, name = "similarity_lin", 
        show_row_names = FALSE, show_column_names = FALSE,
        row_order = od, column_order = od,
        column_title = "Sim_Lin_1998, based on MICA/IC_annotation") +
Heatmap(mat2, name = "similarity_wp", 
        show_row_names = FALSE, show_column_names = FALSE,
        row_order = od, column_order = od,
        column_title = "Sim_WP_1994, based on LCA/depth")

The two heatmaps show very similar patterns. Heatmap under "Sim_WP_1994" in general shows overall stronger similarity signals, but also higher level of “inter-block” signals.

And the scatter plot of the similarity values from the two methods:

plot(mat1[lower.tri(mat1)], mat2[lower.tri(mat2)], pch = 16, col = "#00000010",
    xlab = "Sim_Lin_1998 / IC_annotation / MICA", ylab = "Sim_WP_1994 / depth / LCA")

Last, it is actually not strange that MICA/IC_annotation method is very similar to LCA/depth method, because IC_annotation is calculated based on the aggregation of gene annotations from offspring terms, which actually also takes account of the DAG topology.

Session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] C/UTF-8/C/C/C/C
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ComplexHeatmap_2.16.0 knitr_1.44            rmarkdown_2.25       
## [4] simona_0.99.10       
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.40.1         circlize_0.4.15         shape_1.4.6            
##  [4] rjson_0.2.21            xfun_0.40               bslib_0.5.1            
##  [7] visNetwork_2.1.2        htmlwidgets_1.6.2       GlobalOptions_0.1.2    
## [10] Biobase_2.60.0          vctrs_0.6.4             tools_4.3.1            
## [13] bitops_1.0-7            stats4_4.3.1            parallel_4.3.1         
## [16] Polychrome_1.5.1        AnnotationDbi_1.62.2    RSQLite_2.3.1          
## [19] cluster_2.1.4           blob_1.2.4              pkgconfig_2.0.3        
## [22] RColorBrewer_1.1-3      S4Vectors_0.38.2        scatterplot3d_0.3-44   
## [25] GenomeInfoDbData_1.2.10 compiler_4.3.1          Biostrings_2.68.1      
## [28] codetools_0.2-19        clue_0.3-65             GenomeInfoDb_1.36.4    
## [31] htmltools_0.5.6.1       sass_0.4.7              RCurl_1.98-1.12        
## [34] yaml_2.3.7              crayon_1.5.2            jquerylib_0.1.4        
## [37] GO.db_3.17.0            ellipsis_0.3.2          cachem_1.0.8           
## [40] org.Hs.eg.db_3.17.0     magick_2.8.0            iterators_1.0.14       
## [43] foreach_1.5.2           digest_0.6.33           fastmap_1.1.1          
## [46] colorspace_2.1-0        cli_3.6.1               magrittr_2.0.3         
## [49] DiagrammeR_1.0.10       bit64_4.0.5             XVector_0.40.0         
## [52] httr_1.4.7              matrixStats_1.0.0       igraph_1.5.1           
## [55] bit_4.0.5               png_0.1-8               GetoptLong_1.0.5       
## [58] memoise_2.0.1           evaluate_0.22           IRanges_2.34.1         
## [61] doParallel_1.0.17       rlang_1.1.1             Rcpp_1.0.11            
## [64] glue_1.6.2              DBI_1.1.3               xml2_1.3.5             
## [67] BiocGenerics_0.46.0     rstudioapi_0.15.0       jsonlite_1.8.7         
## [70] R6_2.5.1                zlibbioc_1.46.0